Large Language Model
After the AI binge, companies balk at soaring bills
Playing by a well-worn Silicon Valley playbook, AI companies charged rock-bottom prices to hook customers after ChatGPT burst onto the scene. New York - Artificial intelligence is getting expensive -- and companies are starting to rethink their embrace of the disruptive technology. Playing by a well-worn Silicon Valley playbook, AI companies charged rock-bottom prices to hook customers after ChatGPT burst onto the scene. Kevin Simback of startup incubator Delphi Labs calls it the era of "subsidized intelligence" -- meaning investors were basically footing the bill so companies could offer AI on the cheap. In a time of both misinformation and too much information, quality journalism is more crucial than ever. By subscribing, you can help us get the story right.
Do You Actually Need to Pay for Transcription Software?
Do You Actually Need to Pay for Transcription Software? I tested Wispr Flow and various AI-powered transcription software to see whether you should bother subscribing or stick with free services. The pitch--that you'll be able to write faster by talking out loud instead of typing-- is compelling, especially if you're a slow typist. The marketing promises you'll be able to write at the speed of thought, 4x faster than your keyboard. I already type faster than I can think.
Hands-On With Gemini Spark: I Gave It Access to My Life and It Friend-Zoned My Boyfriend
I Gave Gemini Spark Access to My Life. Google's new AI agent combed through my emails, documents, and calendar to plan a birthday party and still didn't clock the person most important to me. At its recent I/O developer conference, Google introduced Gemini Spark as an always-on agent that connects to your personal data, completes online tasks, and automates aspects of your daily interactions. It's Google's take on the viral OpenClaw agent that rocked Silicon Valley at the start of 2026. OpenClaw's early adopters handed their entire lives over to an AI agent for messaging and scheduling automation--sometimes with bot-induced mishaps causing embarrassing results.
The Download: unlocking lithium and controlling Ebola
Plus: Anthropic is now valued higher than OpenAI. How a new extraction process could unlock the world's lithium A new method for extracting lithium could cut costs and emissions from one of the world's most important materials for EVs and energy storage. The technique uses a weak acid to dissolve silicate minerals. That frees not only the lithium but also other useful materials, including alumina and silica. "At scale, we believe this will be the lowest-cost way of sourcing lithium in the world," says Yet-Ming Chiang, an MIT professor who co-authored a study of the process published yesterday in . Startup Rock Zero is already working to commercialize the research.
Google's best new AI feature is just a really good to-do list
PCWorld highlights Google's Gemini Daily Brief as a standout AI feature that creates personalized to-do lists by scanning Gmail, Google Calendar, and recent chats. Available on Google's AI Pro and Ultra plans, the feature provides actionable buttons like "add to calendar" and "mark complete" for enhanced task management. While Google I/O introduced many AI announcements with limited immediate impact, Daily Brief proves genuinely useful for organizing daily commitments and appointments. Google's big I/O event came and went last week, stuffed to the gills with new AI announcements and functionality. Most of it left me cold . But one -- and only one -- of those Gemini announcements is actually making a difference for me in the week following Google I/O, and it's relatively humble: Daily Brief, a Gemini-generated daily to-do list based on your Google Workspace data.
Anthropic soars to 965bn valuation, leapfrogging OpenAI
Anthropic has usurped OpenAI as the world's most valuable artificial intelligence startup, soaring to a $965bn valuation ahead of expected public listings by the rival firms. Anthropic, the maker of the Claude family of chatbots, said on Thursday that it had raised $65bn from private investors after a fundraising round led by Altimeter Capital, Greenoaks, Dragoneer and Sequoia Capital. "This funding will help us serve the historic demand we are experiencing, stay at the research frontier, and bring Claude to more of the places where work happens," Anthropic's Chief Financial Officer Krishna Rao said in a statement. Altimeter Capital CEO Brad Gerstner hailed the adoption of Claude among the "world's most demanding organisations" as evidence of Anthropic's command in the field. "This momentum positions Anthropic to lead the next phase of AI innovation and capture the enormous opportunity ahead," Gerstner said.
Conf-Gen: Conformal Uncertainty Quantification for Generative Models
Loaiza-Ganem, Gabriel, Zhang, Kevin, Cui, Wei, Law, Marc T., Leung, Kin Kwan
Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.
Anytime-Valid Federated Conformal RAG for LLM Swarms
Dubey, Prasanjit, Huo, Xiaoming
Federated Conformal RAG (FC-RAG) provides distribution-free coverage for a bandwidth-limited swarm of weak language models, but only at a fixed horizon. We extend it to anytime-valid sequential coverage: validity at every stopping time, preserved under predictable adaptive control (recalibration, per-node bandwidth escalation, distilled-student refresh), at no extra cost in assumptions over fixed-horizon FC-RAG. Naive composition fails because FC-RAG's marginal coverage bound makes the betting e-process a non-supermartingale on adverse calibration draws, and Ville's inequality cannot be invoked. We give Anytime-FC-RAG, a sequential extension built on a summable per-step calibration-deviation budget that converts the marginal bound into a strict conditional bound on a calibration-good event, paired with a truncated betting e-process that is a nonnegative supermartingale on the entire probability space. From these two ingredients, we obtain four guarantees: time-uniform alarm validity $\mathbb{P}(\sup_t E_t \ge 1/δ_e) \le δ_e + δ_{\mathrm{cal}}$, a Hoeffding-stitched cumulative-miscoverage envelope at the same total budget, safety under any predictable controller (recalibration, bandwidth escalation, student refresh), and training-side error propagation across an unbounded sequence of Federated Probe-Logit Distillation (FPLD) refreshes via a summable training budget. As a practical consequence, an adaptive controller that escalates retrieval bandwidth only when the e-process crosses a warning threshold matches the alarm rate of a fixed-high-bandwidth schedule at substantially lower communication cost. Experiments on a GPT-2-small + MiniLM swarm across MMLU, DBpedia, and AG News verify the predicted alarm rate, detection delay, envelope coverage, and $14$-$57\%$ bandwidth savings; the alarm fires when and only when coverage genuinely breaks.
On the Optimizer Dependence of Neural Scaling Laws
Ramani, Vansh, Jain, Shourya Vir
The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.
Low Rank for Rank: Uncertainty-Aware Task-Specific LLM Ranking under Sparse Pairwise Comparisons
Li, Jiachun, Simchi-Levi, David, Sun, Will Wei
Pairwise human-preference platforms such as Chatbot Arena have become central to large language model (LLM) evaluation, yet reliable task-specific ranking remains challenging. Global leaderboards mask task heterogeneity, while ranking each fine-grained task independently is unstable under sparse, imbalanced comparisons. We propose a low-rank framework for task-specific LLM ranking from sparse pairwise comparisons, modeling the task-by-model ability matrix $Θ^\star \in \mathbb{R}^{d_t \times d_m}$ as low rank so that information is shared across related tasks while task-specific differences are preserved. We first develop a max-norm ($\ell_\infty$) accurate estimator for the latent scores, combining a convex initializer with alternating-minimization refinement, and prove task-wise top-$K$ recovery guarantees under sparse sampling. Our main contribution is an uncertainty quantification framework for task-specific ranking. We construct cross-fitted one-step debiased estimators for fixed score contrasts -- such as the task-specific ability gap between two models -- yielding asymptotically valid confidence intervals that attain the semiparametric efficiency bound. We then extend the inference to the high-dimensional ranking regime, where per-task ranks and top-$K$ membership are determined by many dependent score-gap hypotheses. Using Gaussian and multiplier-bootstrap calibration, we obtain simultaneous confidence sets for per-task ranks and valid top-$K$ membership tests across many tasks and models. Experiments on synthetic data and Chatbot Arena show that low-rank sharing improves sample efficiency over independent task-wise Bradley-Terry estimation and produces tighter, better-calibrated ranking certificates, with the largest gains in the sparse regime typical of real LLM benchmarks.